[LLaDA 2.0-Uni] Add LLaDA 2.0-Uni multimodal discrete diffusion pipeline#13686
[LLaDA 2.0-Uni] Add LLaDA 2.0-Uni multimodal discrete diffusion pipeline#13686ChinChyi wants to merge 2 commits into
Conversation
Add UniLLaDA pipeline supporting text-to-image, image understanding, and image editing via block-wise iterative discrete diffusion. New components: - UniLLaDaPipeline: main pipeline (DiffusionPipeline subclass) - LLaDA2UniImageTransformer2DModel: image transformer model - LLaDA2UniFlowMatchEulerScheduler: flow matching scheduler - ImageTokenizer: VQ image encoder helper - Documentation and tests
| return torch.cat(result, dim=-1) | ||
|
|
||
|
|
||
| class LLaDA2UniImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin): |
There was a problem hiding this comment.
Is LLaDA2UniImageTransformer2DModel intended to be used as part of the UniLLaDA pipeline? I see that the transformer loaded by the pipeline is a remotely implemented transformers model (LLaDA2MoeModelLM in modeling_llada2uni_moe.py), and this transformer doesn't appear to be used anywhere.
| >>> model = AutoModelForCausalLM.from_pretrained( | ||
| ... model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto" | ||
| ... ) |
There was a problem hiding this comment.
I would prefer a diffusers-native (or transformers-native) implemention of the DiT model so that we don't need trust_remote_code=True.
There was a problem hiding this comment.
it's really ok to have trust_remote_code here, ideally transformer native but it is up to them
we are not going to host transformer models in diffusers
| return_dict: bool, | ||
| ) -> UniLLaDaPipelineOutput | tuple: | ||
| """Text-to-image generation.""" | ||
| result = self.transformer.generate_image( |
There was a problem hiding this comment.
I think the denoising loop should be implemented in UniLLaDaPipeline.__call__ using a scheduler (such as BlockRefinementScheduler), which is the standard diffusers design, rather than in transformer methods like generate_image.
| # ============================================================ | ||
|
|
||
|
|
||
| class ImageTokenizer: |
There was a problem hiding this comment.
| class ImageTokenizer: | |
| class ImageTokenizer(ModelMixin, ConfigMixin): |
I think ImageTokenizer should inherit from ModelMixin and ConfigMixin (which is standard for diffusers models) so that saving and loading can be handled in the normal diffusers way, rather needing to implement it separately in __init__ below.
| OPENAI_CLIP_STD = [0.26862954, 0.26130258, 0.27577711] | ||
|
|
||
|
|
||
| class ImagePreprocessor: |
There was a problem hiding this comment.
I think we should refactor the image preprocessing logic in ImagePreprocessor into a dedicated VaeImageProcessor subclass that lives in its own file (e.g. image_processor.py). See for example JoyImageEditImageProcessor as a reference:
| attn_impl = getattr(self.config, "_attn_implementation", "eager") | ||
| if attn_impl != "eager" and attn_impl in ALL_ATTENTION_FUNCTIONS: | ||
| attention_interface = ALL_ATTENTION_FUNCTIONS[attn_impl] | ||
| if "flash" in attn_impl: | ||
| max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max() | ||
| attn_output, _ = attention_interface( |
There was a problem hiding this comment.
We should consider using dispatch_attention_fn instead as it handles the attention backends used here, such as Flash Attention (including flash_varlen) and torch native SDPA. For reference, see the attention backend docs.
| return self.net(x) | ||
|
|
||
|
|
||
| class SigVQ(nn.Module): |
There was a problem hiding this comment.
Is the SigVQ model intended to be used as part of the UniLLaDA pipeline? I don't see it being used anywhere.
| import PIL.Image | ||
|
|
||
|
|
||
| def generate_crop_size_list( |
There was a problem hiding this comment.
Similar to #13686 (comment), I think we should refactor the image preprocessing logic here into a dedicated VaeImageProcessor subclass (possibly combined with the one from image_tokenizer.py).
dg845
left a comment
There was a problem hiding this comment.
Thanks for the PR! I left an initial design review :).
What does this PR do?
Adds support for LLaDA 2.0-Uni, a unified multimodal discrete diffusion language model that supports text understanding, image understanding, and image generation in a single framework.
Paper: LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
New Components
LLaDA2UniImageTransformer2DModel— Image diffusion transformer for decoding VQ tokens to imagesUniLLaDaPipeline— Unified pipeline supporting three modes:LLaDA2UniFlowMatchEulerScheduler— Flow matching scheduler with Euler ODE integrationKey Features
Usage Example
Testing
tests/pipelines/unillada/test_unillada.pyModel Weights
Official weights available at: https://huggingface.co/inclusionAI/LLaDA2.0-Uni
Before submitting
Who can review?
@yiyixuxu @a-r-r-o-w @DN6